06 Resumen Regresión Lineal

Author

Eddie Aguilar

library(easypackages)
libraries("tidyverse","fpp3", "patchwork","plotly")

Estos modelos se basan en encontrar relaciones lineales entre la serie a pronosticar y una o más series distintas. Por lo tanto, pronosticamos los valores de una serie, a partir de los cambios en otra serie que la afecte.

Ejemplo: La venta de helados tiene una relación con la temperatura, las ventas de Nike tiene relación con cuanto gastan en publicidad y mercadotecnia.

Por lo tanto, tenemos una variable de pronóstico y variables predictoras.

Modelo lineal

Modelo de regresión lineal simple:

\[y_t = \beta_0 + \beta_1x_t + e_t\]

donde:

  • \(\beta_0\) es intercepto y representa el valor predicho cuando x = 0.
  • \(\beta_1\) es la pendiente de la recta. Nos indica el cambio promedio en y, ante un cambio en una unidad de x.
  • \(e_t\) es el error, se asume aleatorio y captura todos los cambios de otras variables que también pueden afectar a \(y_t\) pero no se especifican.

Por lo tanto:

Ejemplo

Tasas de crecimiento del gasto de consumo (y) y su relación con el ingreso personal disponible (x)

us_change %>%
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Consumption, colour = "Consumo")) +
  geom_line(aes(y = Income, colour = "Ingreso")) +
  ylab("cambio %") + xlab("Año") +
  guides(colour=guide_legend(title="Series")) + 
  theme(legend.position = "top")

Diagrama de dispersión entre ambas series:

us_change %>%
  ggplot(aes(x=Income, y=Consumption)) +
    ylab("Consumo (cambio % trimestral)") +
    xlab("Ingreso (cambio % trimestral)") +
    geom_point() +
    geom_smooth(method="lm", se=FALSE)
`geom_smooth()` using formula = 'y ~ x'

La línea sigue la ecuación de la regresión lineal.

Significado

El análisis de regresión en tiempos modernos trata sobre la relación de la dependencia entre una variable y, respecto de una o más variables exógenas (regresoras x) para predecir el valor promedio de la variable dependiente.

Al hablar de funciones lineales, podemos referirnos a las variables x o a los parámetros \(\beta\). En el caso de un modelo de regresión lineal, solo nos interesa la linealidad en los parámetros.

Por lo tanto, las funciones de y pueden ser curvas o cualquier forma (exponencial cuadrática, cúbica) y seguirán siendo lineales en los parámetros y se pueden estimar con un modelo de regresión lineal.

Entonces, un modelo de regresión lineal puede generar una recta, curva dependiendo de la forma funcional.

Ejemplos

US % change

\[y_{consumo} = \beta_0 + \beta_1x_{income} +\beta_2x_{production}+\beta_3x_{savings}+\beta_4x_{unemployment} + e_t\]

us_change
# A tsibble: 198 x 6 [1Q]
   Quarter Consumption Income Production Savings Unemployment
     <qtr>       <dbl>  <dbl>      <dbl>   <dbl>        <dbl>
 1 1970 Q1       0.619  1.04      -2.45    5.30         0.9  
 2 1970 Q2       0.452  1.23      -0.551   7.79         0.5  
 3 1970 Q3       0.873  1.59      -0.359   7.40         0.5  
 4 1970 Q4      -0.272 -0.240     -2.19    1.17         0.700
 5 1971 Q1       1.90   1.98       1.91    3.54        -0.100
 6 1971 Q2       0.915  1.45       0.902   5.87        -0.100
 7 1971 Q3       0.794  0.521      0.308  -0.406        0.100
 8 1971 Q4       1.65   1.16       2.29   -1.49         0    
 9 1972 Q1       1.31   0.457      4.15   -4.29        -0.200
10 1972 Q2       1.89   1.03       1.89   -4.69        -0.100
# … with 188 more rows
us_change %>% 
  as_tibble() %>% 
  select(-Quarter) %>% 
  GGally::ggpairs()
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

us_change %>% 
  pivot_longer(cols = -Quarter) %>% 
  ggplot(aes(x = Quarter, y = value, color = name)) +
  geom_line() +
  facet_wrap(~ name, scales = "free_y") +
  theme(legend.position = "none")

Comparando la variable de pronóstico (Consumo) vs las demás variables predictoras:

us_change %>% 
  pivot_longer(cols = -c(Quarter, Consumption)) %>% 
  ggplot(aes(x = Quarter, y = value, color = name)) +
  geom_line() +
  geom_line(aes(y = Consumption), color = "black") +
  facet_wrap(~ name, scales = "free_y") +
  theme(legend.position = "none")

Regresión lineal simple

Aplicando un primer modelo, usando la variable predictora del ingreso disponible.

fit1 <- us_change %>% 
  model(reg_lin_simple = TSLM(Consumption ~ Income)
        )
fit1 %>%  report()
Series: Consumption 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-2.58236 -0.27777  0.01862  0.32330  1.42229 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.54454    0.05403  10.079  < 2e-16 ***
Income       0.27183    0.04673   5.817  2.4e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5905 on 196 degrees of freedom
Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429
F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08
augment(fit1) %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Consumption, color = "Datos")) +
  geom_line(aes(y = .fitted, color = "Fitted"))+
  xlab("Año") + ylab(NULL) +
  ggtitle("Cambios porcentuales en el gasto de Consumo en EEUU") +
  guides(color = guide_legend(title = NULL))

No parece ajustarse bien.

augment(fit1) %>% 
  ggplot(aes(x = Consumption, y = .fitted)) +
  geom_point() +
  ylab("Fitted (valores ajustados)") +
  xlab("Datos (reales históricos)") +
  ggtitle("Cambios porcentuales en el gasto de Consumo en EEUU") +
  geom_abline(intercept = 0, slope = 1)

fit1 %>% 
  gg_tsresiduals()

augment(fit1) %>% 
  features(.resid, ljung_box, lag= 10, dof = 2)
# A tibble: 1 × 3
  .model         lb_stat lb_pvalue
  <chr>            <dbl>     <dbl>
1 reg_lin_simple    35.1 0.0000259

Se puede mejorar bastante.

Regresión lineal múltiple

\[y_{consumo} = \beta_0 + \beta_1x_{income} +\beta_2x_{production}+\beta_3x_{savings}+\beta_4x_{unemployment} + e_t\]

fit2 <- us_change %>% 
  model(
    reg_lin_multiple = TSLM(Consumption ~ Income + Production + Savings + Unemployment)
  )
report(fit2)
Series: Consumption 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-0.90555 -0.15821 -0.03608  0.13618  1.15471 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
Income        0.740583   0.040115  18.461  < 2e-16 ***
Production    0.047173   0.023142   2.038   0.0429 *  
Savings      -0.052890   0.002924 -18.088  < 2e-16 ***
Unemployment -0.174685   0.095511  -1.829   0.0689 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3102 on 193 degrees of freedom
Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
F-statistic:   160 on 4 and 193 DF, p-value: < 2.22e-16
augment(fit2) %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Consumption, color = "Datos")) +
  geom_line(aes(y = .fitted, color = "Fitted"))+
  xlab("Año") + ylab(NULL) +
  ggtitle("Cambios porcentuales en el gasto de Consumo en EEUU") +
  guides(color = guide_legend(title = NULL))

augment(fit2) %>% 
  ggplot(aes(x = Consumption, y = .fitted)) +
  geom_point() +
  ylab("Fitted (valores ajustados)") +
  xlab("Datos (reales históricos)") +
  ggtitle("Cambios porcentuales en el gasto de Consumo en EEUU") +
  geom_abline(intercept = 0, slope = 1)

fit2 %>% 
  gg_tsresiduals()

augment(fit2) %>% 
  features(.resid, ljung_box, lag= 10, dof = 2)
# A tibble: 1 × 3
  .model           lb_stat lb_pvalue
  <chr>              <dbl>     <dbl>
1 reg_lin_multiple    18.9    0.0156
df <- left_join(us_change, residuals(fit2), by = "Quarter")
df %>% 
  select(-c(Consumption, .model)) %>% 
  pivot_longer(cols = c(Income:Unemployment)) %>% 
  ggplot(aes( x = value, y = .resid, color = name)) + 
  geom_point() + ylab("Residuales") + xlab("Predictoras") +
  facet_wrap(~ name, scales = "free_x") +
  theme(legend.position = "none")

augment(fit2) %>% 
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  labs(x = "Ajustados", y = "Residuales")

glance(fit2) %>% 
  select(adj_r_squared, AIC, AICc, BIC)
# A tibble: 1 × 4
  adj_r_squared   AIC  AICc   BIC
          <dbl> <dbl> <dbl> <dbl>
1         0.763 -457. -456. -437.

Selección de predictoras

Escoger subconjunto de predictoras y probarlo

fit3 <- us_change %>% 
  model(r1 = TSLM(Consumption ~ Income),
        r2 = TSLM(Consumption ~ Income + Production),
        r3 = TSLM(Consumption ~ Income + Production + Savings + Unemployment),
        r4 = TSLM(Consumption ~ Income + Production + Savings),
        r5 = TSLM(Consumption ~ Income + Savings + Unemployment),
        r6 = TSLM(Consumption ~ Income + Production + Unemployment),
        r7 = TSLM(Consumption ~ Income + Savings)
        )
fit3 %>% 
  glance() %>% 
  select(.model, adj_r_squared, AIC, AICc, BIC)
# A tibble: 7 × 5
  .model adj_r_squared   AIC  AICc   BIC
  <chr>          <dbl> <dbl> <dbl> <dbl>
1 r1             0.143 -205. -204. -195.
2 r2             0.336 -254. -254. -241.
3 r3             0.763 -457. -456. -437.
4 r4             0.761 -455. -455. -439.
5 r5             0.760 -454. -454. -438.
6 r6             0.366 -262. -262. -246.
7 r7             0.735 -436. -436. -423.
fit3 %>% 
  select(r3) %>% 
  report()
Series: Consumption 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-0.90555 -0.15821 -0.03608  0.13618  1.15471 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
Income        0.740583   0.040115  18.461  < 2e-16 ***
Production    0.047173   0.023142   2.038   0.0429 *  
Savings      -0.052890   0.002924 -18.088  < 2e-16 ***
Unemployment -0.174685   0.095511  -1.829   0.0689 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3102 on 193 degrees of freedom
Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
F-statistic:   160 on 4 and 193 DF, p-value: < 2.22e-16

Backwards stepwise regression

  • Se epmieza con un modelo con todas las predictoras.
  • Se quita una a la vez.
  • Se mantiene el modelo dependiendo de si hay mejora o no.
us_change %>% 
  model(i = TSLM(Consumption ~ Income + Production + Savings + Unemployment),
        ii = TSLM(Consumption ~ Income + Production + Savings),
        iii = TSLM(Consumption ~ Income + Production + Unemployment),
        iv = TSLM(Consumption ~ Income + Savings + Unemployment),
        v = TSLM(Consumption ~ Production + Savings + Unemployment)
        ) %>% 
  glance() %>% 
  select(.model, adj_r_squared, AIC, AICc, BIC)
# A tibble: 5 × 5
  .model adj_r_squared   AIC  AICc   BIC
  <chr>          <dbl> <dbl> <dbl> <dbl>
1 i              0.763 -457. -456. -437.
2 ii             0.761 -455. -455. -439.
3 iii            0.366 -262. -262. -246.
4 iv             0.760 -454. -454. -438.
5 v              0.349 -257. -257. -241.

Con base a AICc es mejor el que incluye todas las predictoras.

Con base a BIC es mejor el que no incluye al desempleo.

us_change %>% 
  model(i = TSLM(Consumption ~ Income + Production + Savings + Unemployment),
        ii = TSLM(Consumption ~ Income + Production + Savings),
        iii = TSLM(Consumption ~ Income + Production),
        iv = TSLM(Consumption ~ Income + Savings),
        v = TSLM(Consumption ~ Production + Savings),
        vi = TSLM(Consumption ~ Income + Savings + Unemployment),
        vii = TSLM(Consumption ~ Production + Savings + Unemployment),
        viii = TSLM(Consumption ~ Income + Production + Unemployment)
        ) %>% 
  glance() %>% 
  select(.model, adj_r_squared, AIC, AICc, BIC)
# A tibble: 8 × 5
  .model adj_r_squared   AIC  AICc   BIC
  <chr>          <dbl> <dbl> <dbl> <dbl>
1 i              0.763 -457. -456. -437.
2 ii             0.761 -455. -455. -439.
3 iii            0.336 -254. -254. -241.
4 iv             0.735 -436. -436. -423.
5 v              0.324 -251. -250. -238.
6 vi             0.760 -454. -454. -438.
7 vii            0.349 -257. -257. -241.
8 viii           0.366 -262. -262. -246.

Forwards stepwise regression

  • Comenzar con un modelo simple con solo el intercepto.
  • Ir añadiendo predictoras.
  • Elegir solo la que más mejore el modelo.
  • Iterar.
us_change %>% 
  model(i = TSLM(Consumption ~ Income),
        ii = TSLM(Consumption ~ Production),
        iii = TSLM(Consumption ~ Savings),
        iv = TSLM(Consumption ~ Unemployment)
        ) %>% 
  glance() %>% 
  select(.model, adj_r_squared, AIC, AICc, BIC)
# A tibble: 4 × 5
  .model adj_r_squared   AIC  AICc   BIC
  <chr>          <dbl> <dbl> <dbl> <dbl>
1 i             0.143  -205. -204. -195.
2 ii            0.276  -238. -238. -228.
3 iii           0.0611 -187. -186. -177.
4 iv            0.274  -237. -237. -228.
us_change %>% 
  model(
    o   = TSLM(Consumption ~ Production),
    i   = TSLM(Consumption ~ Production + Income),
    ii  = TSLM(Consumption ~ Production + Savings),
    iii = TSLM(Consumption ~ Production + Unemployment)
  ) %>% 
  glance() %>% 
  select(.model, adj_r_squared, AIC, AICc, BIC)
# A tibble: 4 × 5
  .model adj_r_squared   AIC  AICc   BIC
  <chr>          <dbl> <dbl> <dbl> <dbl>
1 o              0.276 -238. -238. -228.
2 i              0.336 -254. -254. -241.
3 ii             0.324 -251. -250. -238.
4 iii            0.308 -246. -246. -233.
us_change %>% 
  model(
    i   = TSLM(Consumption ~ Production + Income),
    ii  = TSLM(Consumption ~ Production + Income + Savings),
    iii = TSLM(Consumption ~ Production + Income + Unemployment)
  ) %>% 
  glance() %>% 
  select(.model, adj_r_squared, AIC, AICc, BIC)
# A tibble: 3 × 5
  .model adj_r_squared   AIC  AICc   BIC
  <chr>          <dbl> <dbl> <dbl> <dbl>
1 i              0.336 -254. -254. -241.
2 ii             0.761 -455. -455. -439.
3 iii            0.366 -262. -262. -246.
us_change %>% 
  model(
    i   = TSLM(Consumption ~ Production + Income + Savings),
    ii  = TSLM(Consumption ~ Production + Income + Savings + Unemployment)
  ) %>% 
  glance() %>% 
  select(.model, adj_r_squared, AIC, AICc, BIC)
# A tibble: 2 × 5
  .model adj_r_squared   AIC  AICc   BIC
  <chr>          <dbl> <dbl> <dbl> <dbl>
1 i              0.761 -455. -455. -439.
2 ii             0.763 -457. -456. -437.

Pronóstico

Pronósticos ex-ante

Solo se utiliza información disponible hasta el último dato del histórico. Se tiene que pronosticar las predictoras antes de poder pronosticar la variable de interés.

# mod_predictoras <- function(predictora, horizonte = 4) {
#   us_change %>% 
#     model(predictora = ARIMA(as.formula(predictora)) %>% 
#     forecast(h = horizonte)
# }
# 
# mod_predictoras(predictora = Income)

ingreso <-  us_change %>% 
  model(ETS = ETS(Income),
        ARIMA = ARIMA(Income)
        ) %>% 
  forecast(h = 4) 

ingreso %>% 
  autoplot(us_change, level = NULL)

produccion <- us_change %>% 
  model(
    ETS = ETS(Production),
    ARIMA = ARIMA(Production)
  ) %>% 
  forecast(h = 4)
produccion %>% 
  autoplot(us_change, level = NULL)

ahorro <- us_change %>% 
  model(
    ETS = ETS(Savings),
    ARIMA = ARIMA(Savings)
  ) %>% 
  forecast(h = 4)
ahorro %>% 
  autoplot(us_change, level = NULL)

desempleo <- us_change %>% 
  model(
    ETS = ETS(Unemployment),
    ARIMA = ARIMA(Unemployment)
  ) %>% 
  forecast(h = 4)
desempleo %>% 
  autoplot(us_change, level = NULL)

Teniendo los pronósticos de las predictoras, generamos el pronóstico del consumo.

fit <- us_change %>% 
  model(
    `Regresión lineal múltiple` = TSLM(Consumption ~ Income + Production + Savings + Unemployment)
  )

datos_futuros <- new_data(us_change,4) %>% 
  mutate(Income = ingreso %>% filter(.model == "ARIMA") %>% pull(.mean), 
         Savings = ahorro %>% filter(.model == "ARIMA") %>% pull(.mean), 
         Unemployment = desempleo %>% filter(.model == "ARIMA") %>% pull(.mean),
         Production = produccion %>% filter(.model == "ARIMA") %>% pull(.mean))

datos_futuros
# A tsibble: 4 x 5 [1Q]
  Quarter Income Savings Unemployment Production
    <qtr>  <dbl>   <dbl>        <dbl>      <dbl>
1 2019 Q3  0.813   3.34       0.00461    0.131  
2 2019 Q4  0.715   0.725     -0.0427    -0.00325
3 2020 Q1  0.729   1.62       0.0248     0.565  
4 2020 Q2  0.729   1.32       0.0307     0.345  
fc <- forecast(fit, datos_futuros)

fc %>% 
  autoplot(us_change)

fc %>% 
  autoplot(us_change %>% filter_index("2016 Q1" ~ .))

Pronósticos ex-post

Se utiliza información real disponible de las predictoras. Estos pronósticos ya no son reales.

Pronósticos basados en escenarios

fit_escenarios <- us_change %>% 
  model(lineal = TSLM(Consumption ~ Income + Savings + Unemployment))

# Necesitamos agregar nuevos datos de las predictoras
escenarios <- scenarios(
  optimista = new_data(us_change, 4) %>% 
    mutate(Income = c(0,0.2,0,1.2), Savings = c(0,-0.1,0.5,-1), Unemployment = -0.1),
  pesimista = new_data(us_change, 4) %>% 
    mutate(Income = -1, Savings = -0.5, Unemployment = 0.1),
  names_to = "Escenario"
)

fc_escenarios <- fit_escenarios %>% 
  forecast(new_data = escenarios)

us_change %>% 
  autoplot(Consumption) +
  autolayer(fc_escenarios)

Inclusión de predictoras útiles

Para la producción de cerveza:

recent_production <- aus_production %>% 
  filter(year(Quarter) >= 1992)

recent_production
# A tsibble: 74 x 7 [1Q]
   Quarter  Beer Tobacco Bricks Cement Electricity   Gas
     <qtr> <dbl>   <dbl>  <dbl>  <dbl>       <dbl> <dbl>
 1 1992 Q1   443    5777    383   1289       38332   117
 2 1992 Q2   410    5853    404   1501       39774   151
 3 1992 Q3   420    6416    446   1539       42246   175
 4 1992 Q4   532    5825    420   1568       38498   129
 5 1993 Q1   433    5724    394   1450       39460   116
 6 1993 Q2   421    6036    462   1668       41356   149
 7 1993 Q3   410    6570    475   1648       42949   163
 8 1993 Q4   512    5675    443   1863       40974   138
 9 1994 Q1   449    5311    421   1468       40162   127
10 1994 Q2   381    5717    475   1755       41199   159
# … with 64 more rows
recent_production %>% 
  autoplot(Beer) +
  labs(x = "Año", y = "Megalitros", 
       title = "Producción de cerveza trimestral en Australia")

Hay predictoras que pueden ser útiles en el análisis de regresión.

  1. Predictora de tendencia trend()

\[y_t = \beta_0 + \beta_1t+e_t\] 2. Variables dummy estacionales

Las variables dummy toman solo dos valores: 1 o 0.

Para agregar variables dummy, solo tiene que escriibr season()

recent_production %>% 
  select(Quarter,Beer) %>% 
  mutate(tendencia = seq_along(recent_production$Quarter),
         q2 = if_else(quarter(Quarter)==2,1,0),
         q3 = if_else(quarter(Quarter)==3,1,0),
         q4 = if_else(quarter(Quarter)==4,1,0)
         ) %>% 
  model(TSLM(Beer ~ tendencia + q2 + q3 + q4)) %>% 
  report()
Series: Beer 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-42.9029  -7.5995  -0.4594   7.9908  21.7895 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 441.80044    3.73353 118.333  < 2e-16 ***
tendencia    -0.34027    0.06657  -5.111 2.73e-06 ***
q2          -34.65973    3.96832  -8.734 9.10e-13 ***
q3          -17.82164    4.02249  -4.430 3.45e-05 ***
q4           72.79641    4.02305  18.095  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.23 on 69 degrees of freedom
Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199
F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16
fit_beer <- recent_production %>% 
  model(TSLM(Beer ~ trend() + season()))

report(fit_beer)
Series: Beer 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-42.9029  -7.5995  -0.4594   7.9908  21.7895 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   441.80044    3.73353 118.333  < 2e-16 ***
trend()        -0.34027    0.06657  -5.111 2.73e-06 ***
season()year2 -34.65973    3.96832  -8.734 9.10e-13 ***
season()year3 -17.82164    4.02249  -4.430 3.45e-05 ***
season()year4  72.79641    4.02305  18.095  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.23 on 69 degrees of freedom
Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199
F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16
p <- augment(fit_beer) %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, color = "Datos")) +
  geom_line(aes(y = .fitted, color = "Fitted")) +
  labs(x = "Año", y = "Megalitros")

ggplotly(p)

Hay un trimestre que está por debajo, se puede agregar una variable de intervención para ese periodo.

Spike variables

Capturan el efecto de un solo periodo

cerveza <- recent_production %>% 
  select(Quarter, Beer) %>% 
  mutate(
    q2_94 = if_else(Quarter == yearquarter("1994 Q2"),1,0),
    q4_04 = if_else(Quarter == yearquarter("2004 Q4"),1,0)
  )
cerveza
# A tsibble: 74 x 4 [1Q]
   Quarter  Beer q2_94 q4_04
     <qtr> <dbl> <dbl> <dbl>
 1 1992 Q1   443     0     0
 2 1992 Q2   410     0     0
 3 1992 Q3   420     0     0
 4 1992 Q4   532     0     0
 5 1993 Q1   433     0     0
 6 1993 Q2   421     0     0
 7 1993 Q3   410     0     0
 8 1993 Q4   512     0     0
 9 1994 Q1   449     0     0
10 1994 Q2   381     1     0
# … with 64 more rows
fit_beer <- cerveza %>% 
  model(TSLM(Beer ~ trend() + season() + q2_94 + q4_04))

report(fit_beer)
Series: Beer 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-27.0379  -5.8811  -0.6895   7.7209  21.7895 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   441.84123    3.32189 133.009  < 2e-16 ***
trend()        -0.34137    0.05975  -5.713 2.78e-07 ***
season()year2 -33.39369    3.55788  -9.386 7.83e-14 ***
season()year3 -17.82164    3.55460  -5.014 4.16e-06 ***
season()year4  75.32030    3.60790  20.876  < 2e-16 ***
q2_94         -24.03384   11.24265  -2.138 0.036190 *  
q4_04         -45.41027   11.15547  -4.071 0.000126 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.81 on 67 degrees of freedom
Multiple R-squared: 0.9426, Adjusted R-squared: 0.9375
F-statistic: 183.4 on 6 and 67 DF, p-value: < 2.22e-16
p <- augment(fit_beer) %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, color = "Datos")) +
  geom_line(aes(y = .fitted, color = "Fitted")) +
  labs(x = "Año", y = "Megalitros")

ggplotly(p)

Cambios de nivel

Capturan el efecto a partir de cierto periodo

cerveza <- cerveza %>% 
  mutate(d2000 = if_else(year(Quarter)>=2000,1,0))
cerveza
# A tsibble: 74 x 5 [1Q]
   Quarter  Beer q2_94 q4_04 d2000
     <qtr> <dbl> <dbl> <dbl> <dbl>
 1 1992 Q1   443     0     0     0
 2 1992 Q2   410     0     0     0
 3 1992 Q3   420     0     0     0
 4 1992 Q4   532     0     0     0
 5 1993 Q1   433     0     0     0
 6 1993 Q2   421     0     0     0
 7 1993 Q3   410     0     0     0
 8 1993 Q4   512     0     0     0
 9 1994 Q1   449     0     0     0
10 1994 Q2   381     1     0     0
# … with 64 more rows
fit_beer <- cerveza %>% 
  model(TSLM(Beer ~ trend() + season() + d2000))

report(fit_beer)
Series: Beer 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-43.0235  -7.5945  -0.4539   7.7754  21.4829 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   441.9154     3.8644 114.354  < 2e-16 ***
trend()        -0.3548     0.1308  -2.712  0.00846 ** 
season()year2 -34.6452     3.9985  -8.665 1.36e-12 ***
season()year3 -17.8046     4.0536  -4.392 4.02e-05 ***
season()year4  72.8279     4.0594  17.941  < 2e-16 ***
d2000           0.7282     5.6402   0.129  0.89766    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.32 on 68 degrees of freedom
Multiple R-squared: 0.9243, Adjusted R-squared: 0.9188
F-statistic: 166.1 on 5 and 68 DF, p-value: < 2.22e-16
p <- augment(fit_beer) %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, color = "Datos")) +
  geom_line(aes(y = .fitted, color = "Fitted")) +
  labs(x = "Año", y = "Megalitros")

ggplotly(p)
cerveza %>% 
  gg_tsdisplay(Beer %>% difference(4), plot_type = "partial")
Warning: Removed 4 rows containing missing values (`geom_line()`).
Warning: Removed 4 rows containing missing values (`geom_point()`).

Regresiones no lineales

Modelos exponenciales

Modelos log.log \[logy_t = \beta_0 + \beta_1log x_t\]

Modelos lin-log \[y_t = \beta_0 + \beta_1log x_t\]

Modelos log-lin \[logy_t = \beta_0 + \beta_1x_t\]

Modelos de regresión lineal por partes (picewise)

boston_men <- boston_marathon %>% 
  filter(Event == "Men's open division") %>% 
  mutate(Minutes = as.numeric(Time)/60)

p <- boston_men %>% 
  autoplot(Minutes) + 
  ggtitle("Tiempos ganadores del maratón de Boston, categoría abierta de hombres")

ggplotly(p)
fit_boston <- boston_men %>% 
  model(
    lineal = TSLM(Minutes ~ trend()),
    )

fc_boston <- fit_boston %>% forecast(h = 10)

boston_men %>% 
  autoplot(Minutes) +
  geom_line(aes(y = .fitted, color = .model), data = fitted(fit_boston)) +
  autolayer(fc_boston, alpha = 0.5, level = 95) +
  ggtitle("Maratón de Boston, cat. abierta de hombres")

fit_boston <- boston_men %>% 
  model(
    lineal = TSLM(Minutes ~ trend()),
    exponencial = TSLM(log(Minutes) ~ trend()),
    `Reg. por partes` = TSLM(Minutes ~ trend(knots = c(1940,1980)))
  )

fc_boston <- fit_boston %>% forecast(h = 10)

boston_men %>% 
  autoplot(Minutes) +
  geom_line(aes(y = .fitted, color = .model), data = fitted(fit_boston)) +
  autolayer(fc_boston, alpha = 0.5, level = 95) +
  ggtitle("Maratón de Boston, cat. abierta de hombres")

fit_boston <- boston_men %>% 
  model(
    lineal = TSLM(Minutes ~ trend()),
    exponencial = TSLM(log(Minutes) ~ trend()),
    `Reg. por partes` = TSLM(Minutes ~ trend(knots = c(1940,1980))),
    `Reg. por partes2` = TSLM(Minutes ~ trend(knots = c(1975))),
    `Reg. por partes3` = TSLM(Minutes ~ trend(knots = c(1915, 1950, 1988))),
    `Reg. por partes4` = TSLM(Minutes ~ trend(knots = c(1915, 1927, 1950, 1985)))
  )

fc_boston <- fit_boston %>% forecast(h = 10)

boston_men %>% 
  autoplot(Minutes) +
  geom_line(aes(y = .fitted, color = .model), data = fitted(fit_boston)) +
  autolayer(fc_boston, alpha = 0.5, level = 95) +
  ggtitle("Maratón de Boston, cat. abierta de hombres")

accuracy(fit_boston) %>% 
  arrange(MAPE)
# A tibble: 6 × 11
  Event      .model .type        ME  RMSE   MAE     MPE  MAPE  MASE RMSSE   ACF1
  <fct>      <chr>  <chr>     <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
1 Men's ope… Reg. … Trai… -9.24e-16  4.64  3.18 -0.0958  2.19 0.727 0.716 0.0323
2 Men's ope… Reg. … Trai… -3.46e-16  5.12  3.67 -0.116   2.53 0.838 0.789 0.196 
3 Men's ope… Reg. … Trai…  9.24e-16  5.68  4.00 -0.141   2.72 0.913 0.876 0.312 
4 Men's ope… Reg. … Trai…  6.93e-16  6.01  4.57 -0.164   3.15 1.04  0.928 0.402 
5 Men's ope… expon… Trai…  1.26e- 1  6.10  4.76 -0.0845  3.29 1.09  0.941 0.415 
6 Men's ope… lineal Trai… -4.62e-16  6.13  4.79 -0.169   3.31 1.09  0.945 0.417 

Box-cox

lambda <- boston_men %>% 
  features(Minutes, guerrero) %>% 
  pull(lambda_guerrero)

p1 <- boston_men %>% 
  autoplot(Minutes)

p2 <- boston_men %>% 
  autoplot(box_cox(Minutes, lambda = lambda))

p3 <- boston_men %>% 
  autoplot(log(Minutes))

p1/p2/p3